Customer Segmentation in the US
In this project, the goal is to identify households that have a difficulty getting credit
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.mstats import trimmed_var
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
The dataset tracks behaviors relating to the ways households earn, save, and spend money in the United States
# Reading the gzip-csv file into a DataFrame
df = pd.read_csv("SCFP2019.csv.gz")
print("df shape:", df.shape)
df.head()
df shape: (28885, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 11 | 6119.779308 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 2 | 10 | 6 | 6 | 3 | 3 |
| 1 | 1 | 12 | 4712.374912 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
| 2 | 1 | 13 | 5145.224455 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
| 3 | 1 | 14 | 5297.663412 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 2 | 6 | 2 | 1 | 10 | 4 | 4 | 2 | 2 |
| 4 | 1 | 15 | 4761.812371 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
5 rows × 351 columns
The dictionary of this dataset can be found at https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbkfx0.htm
The goal of this project is to identify households that have a difficulty getting credit so an interesting column to keep an eye out for is:
TURNFEAR | Household has been turned down for credit or feared being denied credit in the past 5 years
#Using a amask to subset create df to only households that have been turned down or feared being turned down for credit
mask = df["TURNFEAR"] == 1
df_fear = df[mask]
print("df_fear shape:", df_fear.shape)
df_fear.head()
df_fear shape: (4623, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2 | 21 | 3790.476607 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 6 | 2 | 22 | 3798.868505 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 3 | 2 | 2 |
| 7 | 2 | 23 | 3799.468393 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 8 | 2 | 24 | 3788.076005 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 9 | 2 | 25 | 3793.066589 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
5 rows × 351 columns
Another useful feature to use will be age, the only columns age related are:
When working with segmentation a good approach is to give priority to group related features. Although [AGECL] is numerical it is in fact a categorical feature
# Available age groups
age_groups = df_fear["AGECL"].unique()
print("Age Groups:", age_groups)
Age Groups: [3 5 1 2 4 6]
df_fear["AGECL"].head(10)
5 3 6 3 7 3 8 3 9 3 110 5 111 5 112 5 113 5 114 5 Name: AGECL, dtype: int64
Looking at the dictionary there is a description of what each number represents

# Replacing numeric categories with readable ones
agecl_dict = {
1: "Under 35",
2: "35-44",
3: "45-54",
4: "55-64",
5: "65-74",
6: "75 or Older",
}
age_cl = df_fear["AGECL"].replace(agecl_dict)
age_cl.head(10)
5 45-54 6 45-54 7 45-54 8 45-54 9 45-54 110 65-74 111 65-74 112 65-74 113 65-74 114 65-74 Name: AGECL, dtype: object
# Value counts to have a sense of how the age distribution is presented
age_cl_value_counts = age_cl.value_counts(normalize=True)
# Bar plot of `age_cl_value_counts`
age_cl_value_counts.plot(kind="bar")
plt.xlabel("Age Group")
plt.ylabel("Frequency (count)")
plt.title("Credit Fearful: Age Groups");
AGECL feature can give some insights but looking at [AGE] can also be a good idea regarding dataset exploration
# Plot histogram of "AGE"
df_fear["AGE"].hist(bins=10)
plt.xlabel("Age")
plt.ylabel("Frequency (count)")
plt.title("Credit Fearful: Age Distribution");
What is noticed is that although the majority group is Under 35 the most common age seems to be between 30 and 40 years old
Another useful feature to use will related to income:
To better visualize the information given in the data creating a DataFrame that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset is ideal
# Income percentile groups dictionary
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
# Data filtering and construction
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
.sort_values(ascending=False)
.to_frame()
.reset_index()
)
df_inccat
| TURNFEAR | INCCAT | frequency | |
|---|---|---|---|
| 0 | 0 | 90-100 | 0.297296 |
| 1 | 1 | 0-20 | 0.288125 |
| 2 | 1 | 21-39.9 | 0.256327 |
| 3 | 1 | 40-59.9 | 0.228856 |
| 4 | 0 | 60-79.9 | 0.174841 |
| 5 | 0 | 40-59.9 | 0.143146 |
| 6 | 0 | 0-20 | 0.140343 |
| 7 | 0 | 21-39.9 | 0.135933 |
| 8 | 1 | 60-79.9 | 0.132598 |
| 9 | 0 | 80-89.9 | 0.108441 |
| 10 | 1 | 90-100 | 0.048886 |
| 11 | 1 | 80-89.9 | 0.045209 |
# Create bar chart of "df_inccat"
sns.barplot(
x="INCCAT",
y="frequency",
hue="TURNFEAR",
data=df_inccat,
order=inccat_dict.values()
)
# Update axis and title details
plt.xlabel("Income Category")
plt.ylabel("Frequency (%)")
plt.title("Income Distribution: Credit Fearful vs. Non-fearful")
# Put the legend out of the figure
plt.legend(title='TURNFEAR', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
Comparing the income categories across the fearful and non-fearful groups, it is noticeable that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes.
So far based on data exploration the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. Which makes sense since young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.
Establishing relationships among the variables, and making some correlation matrices is a good way to make more sense of the data
# Correlation ASSET X HOUSES in the whole dataset
asset_house_corr = df["ASSET"].corr(df["HOUSES"])
print("Asset Houses Correlation:", asset_house_corr)
Asset Houses Correlation: 0.5198273544779253
A positive moderate correlation which is expected since, usually, the most valuable asset one can own on is their property
# Correlation ASSET X HOUSES in the "df_fear" dataset
asset_house_corr = asset_house_corr = df_fear["ASSET"].corr(df_fear["HOUSES"])
print("Credit Fearful Asset Houses Correlation:", asset_house_corr)
Credit Fearful Asset Houses Correlation: 0.583287973597916
The correlation is even greater when using the fearful records
A few other features are also interesting to investigate
# Selecting all important features discovered so far
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
# Checking their correlation on the whole dataset
corr = df[cols].corr()
corr.style.background_gradient(axis=None)
| ASSET | HOUSES | INCOME | DEBT | EDUC | |
|---|---|---|---|---|---|
| ASSET | 1.000000 | 0.519827 | 0.622429 | 0.261250 | 0.116673 |
| HOUSES | 0.519827 | 1.000000 | 0.247852 | 0.266661 | 0.169300 |
| INCOME | 0.622429 | 0.247852 | 1.000000 | 0.114646 | 0.069400 |
| DEBT | 0.261250 | 0.266661 | 0.114646 | 1.000000 | 0.054179 |
| EDUC | 0.116673 | 0.169300 | 0.069400 | 0.054179 | 1.000000 |
# Checking their correlation in the "df_fear" dataset
corr = df_fear[cols].corr()
corr.style.background_gradient(axis=None)
| ASSET | HOUSES | INCOME | DEBT | EDUC | |
|---|---|---|---|---|---|
| ASSET | 1.000000 | 0.583288 | 0.722074 | 0.474658 | 0.113536 |
| HOUSES | 0.583288 | 1.000000 | 0.264099 | 0.962629 | 0.160348 |
| INCOME | 0.722074 | 0.264099 | 1.000000 | 0.172393 | 0.133170 |
| DEBT | 0.474658 | 0.962629 | 0.172393 | 1.000000 | 0.177386 |
| EDUC | 0.113536 | 0.160348 | 0.133170 | 0.177386 | 1.000000 |
There are important differences, the relationship between DEBT and HOUSES is positive for both datasets, but while the coefficient for the whole dataset is fairly weak at 0.26, the same number for df_fear is 0.96

Creating a DataFrame that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset is ideal get a sense of the data. This will be similar in format to df_inccat, but focus on education and since the descriptions of the labels are quite long I will not be using a dictionary this time
# Data filtering and construction
df_educ = (
df["EDUC"]
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
.sort_values(ascending=False)
.to_frame()
.reset_index()
)
df_educ.head()
| TURNFEAR | EDUC | frequency | |
|---|---|---|---|
| 0 | 1 | 8 | 0.293532 |
| 1 | 0 | 12 | 0.257481 |
| 2 | 1 | 9 | 0.212200 |
| 3 | 0 | 8 | 0.192029 |
| 4 | 0 | 13 | 0.149823 |
# Create bar chart of "df_educ"
sns.barplot(
x="EDUC",
y="frequency",
hue="TURNFEAR",
data=df_educ
)
plt.xlabel("Education Level")
plt.ylabel("Frequency (%)")
plt.title("Educational Attainment: Credit Fearful vs. Non-fearful");
In this plot, it is noticeable that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful, in fact, the trend only changes if the participant has at least a Bachelor's degree (Education Level 12)
Investigating DEBT x ASSET
# Creating a scatter plot of ASSET vs DEBT using the whole dataset
df.plot.scatter(x="DEBT", y="ASSET");
# Create scatter plot of ASSET vs DEBT using "df_fear"
df_fear.plot.scatter(x="DEBT", y="ASSET");
The relationship in df_fear graph is flatter than the one in df graph, but they clearly are different
Investigating DEBT x HOUSES
# Create scatter plot of HOUSES vs DEBT using the whole dataset
df.plot.scatter(x="DEBT", y="HOUSES");
# Create scatter plot of HOUSES vs DEBT using "df_fear"
df_fear.plot.scatter(x="DEBT", y="HOUSES");
There are outliers but the relationship is clear enough, the df_fear graph shows an almost perfect linear relationship, while the df graph shows something a little more muddled. It also noticable that the datapoints on the df_fear graph form several little groups
After all the insights obtained through the data analysis it might be time to start building the segmentation model. However 351 features is quite a large number. A good practice to select the best features for clustering is to determine which numerical features have the largest variance
# Checking the variance, getting 10 largest features
top_ten_var = df.var().sort_values().tail(10)
top_ten_var
EQUITY 5.674332e+14 FIN 1.054968e+15 ACTBUS 2.246149e+15 KGBUS 2.398037e+15 KGTOTAL 2.805429e+15 BUS 3.184224e+15 NHNFIN 3.606887e+15 NFIN 3.764771e+15 NETWORTH 6.142046e+15 ASSET 6.275796e+15 dtype: float64
# Create horizontal bar chart of `top_ten_var`
fig = px.bar(
x=top_ten_var,
y=top_ten_var.index,
title="SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Variance", yaxis_title="Feature")
fig.show()
df.ASSET.hist();
df.NETWORTH.hist();
The dataset is massively right-skewed because of the huge outliers on the right side of the distribution so a treatment is definitely in order
I will be creating a wrangle function to help organize the code and clean the data
def wrangle(filepath):
"""
Read a gzip / csv file into a 'DataFrame'.
Returns treated Data in a pandas DataFrame
Parameters
----------
filepath: str
Path of the gzip / csv file
"""
# Reading
df = pd.read_csv(filepath)
# Masking all the fearful records and filtering only NETWORTHS below 2M
mask = (df['TURNFEAR'] == 1) & (df['NETWORTH'] < 2_000_000)
# Apllying the mask
df = df[mask]
return df
df_fear = wrangle("SCFP2019.csv.gz")
print(df_fear.shape)
df_fear.head()
(4418, 351)
| YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2 | 21 | 3790.476607 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 6 | 2 | 22 | 3798.868505 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 3 | 2 | 2 |
| 7 | 2 | 23 | 3799.468393 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 8 | 2 | 24 | 3788.076005 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
| 9 | 2 | 25 | 3793.066589 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
5 rows × 351 columns
df_fear.NETWORTH.hist();
# Create a boxplot of `NHNFIN`
fig = px.box(
data_frame=df_fear,
x="NETWORTH",
title="Distribution of Non-home, Non-Financial Assets"
)
fig.update_layout(xaxis_title="Value [$]")
fig.show()
Although much better there are still some values that might affect the total variance of a feature. That is the reason I will be using trimmed variance, where extreme values are removed before calculating variance.
# Calculate trimmed variance
top_ten_trim_var = df_fear.apply(trimmed_var, limits=(0.1, 0.1)).sort_values().tail(10)
top_ten_trim_var
WAGEINC 5.550737e+08 HOMEEQ 7.338377e+08 NH_MORT 1.333125e+09 MRTHEL 1.380468e+09 PLOAN1 1.441968e+09 DEBT 3.089865e+09 NETWORTH 3.099929e+09 HOUSES 4.978660e+09 NFIN 8.456442e+09 ASSET 1.175370e+10 dtype: float64
# Create horizontal bar chart of "top_ten_trim_var"
fig = px.bar(
x=top_ten_trim_var,
y=top_ten_trim_var.index,
title="SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Trimmed Variance", yaxis_title="Feature")
fig.show()
There a few things to notice in this plot. First, the variances have decreased a lot. In the previous chart, the x-axis went up to \$80 billion; this one goes up to \\$12 billion. Second, the top 10 features have changed a bit. All the features relating to business ownership (ending with "BUS") are gone. Finally, it is noticeable that there are big differences in variance from feature to feature. For example, the variance for WAGEINC is around \$500 million, while the variance for `ASSET` is nearly \\$12 billion. In other words, these features have completely different scales. This is something that need to be addressed before making good clusters.
All in all the first five features seems to have the largest variance in a way that the sixth is around half of the fifth and moving forward they are all similar. For that reason, choosing the first five to build the model seem to be the optimum decision
high_var_cols = top_ten_trim_var.tail().index.to_list()
high_var_cols
['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']
# Creating the feature matrix X
X = df_fear[high_var_cols]
print("X shape:", X.shape)
X.head()
X shape: (4418, 5)
| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| 5 | 12200.0 | -6710.0 | 0.0 | 3900.0 | 5490.0 |
| 6 | 12600.0 | -4710.0 | 0.0 | 6300.0 | 7890.0 |
| 7 | 15300.0 | -8115.0 | 0.0 | 5600.0 | 7185.0 |
| 8 | 14100.0 | -2510.0 | 0.0 | 10000.0 | 11590.0 |
| 9 | 15400.0 | -5715.0 | 0.0 | 8100.0 | 9685.0 |
During EDA, it was noticed a scale issue among the features. That issue can make it harder to cluster the data, so it is necessary to fix that to further analysis. A good strategy to use is standardization
# Creating a DataFrame with the mean and standard deviation for all the features in X
X_summary = X.aggregate(["mean", "std"]).astype(int)
X_summary
| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| mean | 72701 | 76387 | 74530 | 117330 | 149089 |
| std | 135950 | 220159 | 154546 | 239038 | 288166 |
# Instantiate transformer
ss = StandardScaler()
# Transform "X"
X_scaled_data = ss.fit_transform(X)
# Put "X_scaled_data" into DataFrame
X_scaled = pd.DataFrame(X_scaled_data, columns=X.columns)
print("X_scaled shape:", X_scaled.shape)
X_scaled.head()
X_scaled shape: (4418, 5)
| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| 0 | -0.445075 | -0.377486 | -0.48231 | -0.474583 | -0.498377 |
| 1 | -0.442132 | -0.368401 | -0.48231 | -0.464541 | -0.490047 |
| 2 | -0.422270 | -0.383868 | -0.48231 | -0.467470 | -0.492494 |
| 3 | -0.431097 | -0.358407 | -0.48231 | -0.449061 | -0.477206 |
| 4 | -0.421534 | -0.372966 | -0.48231 | -0.457010 | -0.483818 |
Now, all five of the features use the same scale. For verification, looking at their mean and standard deviation is ideal
# Creating a DataFrame with the mean and standard deviation for all the features in X_scaled
X_scaled_summary = X_scaled.aggregate(["mean", "std"]).astype(int)
X_scaled_summary
| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| mean | 0 | 0 | 0 | 0 | 0 |
| std | 1 | 1 | 1 | 1 | 1 |
Everything looks accordingly since standardization takes all the features and scales them so that they all have a mean of 0 and a standard deviation of 1.
Using a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Each time a model is trained, the inertia is calculated and added to the list inertia_errors, then the silhouette score is calculated and added to the list silhouette_scores.
# List of number of clusters
n_clusters = range(2, 13)
inertia_errors = []
silhouette_scores = []
# Add "for" loop to train model and calculate inertia, silhouette score
for k in n_clusters:
# Build model
model = make_pipeline(StandardScaler(), KMeans(n_clusters=k, random_state=42))
# Train model
model.fit(X)
# Calculate inertia
inertia_errors.append(model.named_steps["kmeans"].inertia_)
# Calculate silhouette score
silhouette_scores.append(silhouette_score(X, model.named_steps["kmeans"].labels_))
print("Inertia:", inertia_errors[:3])
print()
print("Silhouette Scores:", silhouette_scores[:3])
Inertia: [11028.058082607182, 7190.5263035753505, 5924.997726868028] Silhouette Scores: [0.7464502937083215, 0.7044601307791996, 0.6962653079183132]
Using plotly express to create a line plot that shows the values of inertia_errors as a function of n_clusters:
# Creating line plot of `inertia_errors` vs `n_clusters`
fig = px.line(
x=n_clusters,
y=inertia_errors,
title="K-Means Model: Inertia vs Number of Clusters"
)
fig.update_layout(
xaxis_title="Number of Clusters [k]",
yaxis_title="Inertia"
)
fig.show()
It is noticeable that the line starts to flatten out around 4 or 5 clusters
Using plotly express to create a line plot that shows the values of silhouette_scores as a function of n_clusters:
# Create a line plot of `silhouette_scores` vs `n_clusters`
fig = px.line(
x=n_clusters,
y=silhouette_scores,
title="K-Means Model: Silhouette Score vs Number of Clusters"
)
fig.update_layout(
xaxis_title="Number of Clusters",
yaxis_title="Silhouette Score"
)
fig.show()
This plot might less straightforward, but when putting the information from this plot together with the inertia plot, it seems like the best setting for n_clusters will be 4.
Building and training a new k-means model:
final_model = make_pipeline(
StandardScaler(),
KMeans(n_clusters=4, random_state=42)
)
final_model.fit(X)
Pipeline(steps=[('standardscaler', StandardScaler()),
('kmeans', KMeans(n_clusters=4, random_state=42))])
Extracting the labels
labels = final_model.named_steps["kmeans"].labels_
print(labels[:5])
[0 0 0 0 0]
xgb = X.groupby(labels).mean()
xgb
| DEBT | NETWORTH | HOUSES | NFIN | ASSET | |
|---|---|---|---|---|---|
| 0 | 26551.075439 | 13676.153182 | 13745.637777 | 2.722605e+04 | 4.022723e+04 |
| 1 | 218112.818182 | 174713.441558 | 257403.246753 | 3.305884e+05 | 3.928263e+05 |
| 2 | 116160.779817 | 965764.155963 | 264339.449541 | 7.800611e+05 | 1.081925e+06 |
| 3 | 732937.575758 | 760397.575758 | 826136.363636 | 1.276227e+06 | 1.493335e+06 |
Using plotly express to create a side-by-side bar chart from xgb that shows the mean of the features in X for each of the clusters the final_model:
# Create side-by-side bar chart of `xgb`
fig = px.bar(
xgb,
barmode="group",
title="Mean Household Finances by Cluster"
)
fig.update_layout(xaxis_title="Cluster", yaxis_title="Value [$]")
fig.show()
The clusters are based partially on NETWORTH, which means that the households in the 0 cluster have the smallest net worth, and the households in the 2 cluster have the highest. Based on that, there are some interesting things to unpack.
The value of the debt for the people in cluster 0 is higher than the value of their houses, suggesting that most of the debt being carried by those people is tied up in their mortgages, that is, if they own a home at all. Contrasting that with the other three clusters it is noticeable that the value of everyone else's debt is lower than the value of their homes
To represent the clusters in a plot, it is necessary to reduce the dimensionality of the data. Such process is achieved by using a model such PCA
Creating a PCA transformer then using it to reduce the dimensionality of the data in X to 2:
# Instantiate transformer
pca = PCA(n_components=2, random_state=42)
# Transform "X"
X_t = pca.fit_transform(X)
# Put "X_t" into DataFrame
X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])
print("X_pca shape:", X_pca.shape)
X_pca.head()
X_pca shape: (4418, 2)
| PC1 | PC2 | |
|---|---|---|
| 0 | -221525.424530 | -22052.273003 |
| 1 | -217775.100722 | -22851.358068 |
| 2 | -219519.642175 | -19023.646333 |
| 3 | -212195.720367 | -22957.107039 |
| 4 | -215540.507551 | -20259.749306 |
# Create scatter plot of `PC2` vs `PC1`
fig = px.scatter(
data_frame=X_pca,
x="PC1",
y="PC2",
color=labels.astype(str),
title="PCA Representation of Clusters",
)
fig.update_layout(xaxis_title="PC1", yaxis_title="PC2")
fig.show()
The plot represents four tightly-grouped clusters were made sharing some key features